Statistical learning: basics and regression problems

MACS 30500
University of Chicago

February 6, 2017

What is statistical learning?

Functional form

\[Y = f(X) + \epsilon\]

  • Statistical learning refers to the set of approaches for estimating \(f\)

Linear functional form

Why estimate \(f\)?

  • Prediction
  • Inference
  • How do we estimate \(f\)?
    • Parametric methods
    • Non-parametric methods

Parametric methods

  1. First make an assumption about the functional form of \(f\)
  2. After a model has been selected, fit or train the model using the actual data

OLS

Parametric methods

\[Y = \beta_0 + \beta_{1}X_1\]

  • \(Y =\) sales
  • \(X_{1} =\) advertising spending in a given medium
  • \(\beta_0 =\) intercept
  • \(\beta_1 =\) slope

Non-parametric methods

  • No assumptions about functional form
  • Use data to estimate \(f\) directly
    • Get close to data points
    • Avoid overcomplexity
  • Requires large amount of observations

LOESS

Types of learning

  • Supervised
  • Unsupervised

Statistical learning vs. machine learning

  • Statistical learning
    • Subfield of statistics
    • Focused predominantly on inference
  • Machine learning
    • Subfield of computer science
    • Focused predominantly on prediction

Linear models

\[y = \beta_0 + \beta_1 * x\]

Linear models

Least squares regression

Estimating a linear model using lm()

sim1_mod <- lm(y ~ x, data = sim1)
coef(sim1_mod)
## (Intercept)           x 
##    4.220822    2.051533

str(lm())

str(sim1_mod)
## List of 12
##  $ coefficients : Named num [1:2] 4.22 2.05
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "x"
##  $ residuals    : Named num [1:30] -2.072 1.238 -4.147 0.665 1.919 ...
##   ..- attr(*, "names")= chr [1:30] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:30] -84.92 32.275 -4.13 0.761 2.015 ...
##   ..- attr(*, "names")= chr [1:30] "(Intercept)" "x" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:30] 6.27 6.27 6.27 8.32 8.32 ...
##   ..- attr(*, "names")= chr [1:30] "1" "2" "3" "4" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:30, 1:2] -5.477 0.183 0.183 0.183 0.183 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:30] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "x"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   ..$ qraux: num [1:2] 1.18 1.24
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 28
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = y ~ x, data = sim1)
##  $ terms        :Classes 'terms', 'formula'  language y ~ x
##   .. ..- attr(*, "variables")= language list(y, x)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "y" "x"
##   .. .. .. ..$ : chr "x"
##   .. ..- attr(*, "term.labels")= chr "x"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(y, x)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
##  $ model        :'data.frame':   30 obs. of  2 variables:
##   ..$ y: num [1:30] 4.2 7.51 2.13 8.99 10.24 ...
##   ..$ x: int [1:30] 1 1 1 2 2 2 3 3 3 4 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language y ~ x
##   .. .. ..- attr(*, "variables")= language list(y, x)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "y" "x"
##   .. .. .. .. ..$ : chr "x"
##   .. .. ..- attr(*, "term.labels")= chr "x"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(y, x)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
##  - attr(*, "class")= chr "lm"

Predicted values

Residuals

Estimating linear models using gapminder

Go here.

Exercise: linear regression with scorecard

Complete this exercise

Acknowledgments